Diffusion-Based Causal Representation Learning

Causal reasoning can be considered a cornerstone of intelligent systems. Having access to an underlying causal graph comes with the promise of cause–effect estimation and the identification of efficient and safe interventions. However, learning causal representations remains a major challenge, due to the complexity of many real-world systems. Previous works on causal representation learning have mostly focused on Variational Auto-Encoders (VAEs). These methods only provide representations from a point estimate, and they are less effective at handling high dimensions. To overcome these problems, we propose a Diffusion-based Causal Representation Learning (DCRL) framework which uses diffusion-based representations for causal discovery in the latent space. DCRL provides access to both single-dimensional and infinite-dimensional latent codes, which encode different levels of information. In a first proof of principle, we investigate the use of DCRL for causal representation learning in a weakly supervised setting. We further demonstrate experimentally that this approach performs comparably well in identifying the latent causal structure and causal variables.


Introduction
Causal representation learning consists of uncovering a system's latent causal factors and their relationships, from observed low-level data.It finds applicability in domains such as autonomous driving [1], robotics [2], healthcare [3], climate studies [4], epidemiology [5,6], and finance [7].Furthermore, recent advancements in Large Language Models (LLMs) underscore the growing importance of studying causal representation learning in this domain [8][9][10].In these tasks, the underlying causal variables are often unknown, and we only have access to low-level representations.
Causal representation learning is a challenging problem.In fact, identifying latent causal factors is generally impossible from observational data only.There has been an ongoing effort to study sets of assumptions that ensure the identifiability of causal variables and their relationships [1,[11][12][13][14][15][16][17].These approaches consider the availability of additional information, or they use assumptions on the underlying causal structure of the DGP.However, many of these assumptions, such as Causal Faithfulness [18] cannot be verified.However, it is possible to identify latent causal factors from observational and interventional data.Brehmer et al. [14] considers a weak form of supervision, in which we have access to a data pair, corresponding to the state of the system before and after a random unknown intervention.Brehmer et al. [14] proves that, in this weakly supervised setting, the structure and the causal variables are identifiable up to a relabeling and elementwise reparameterization.
The rest of the paper is organized as follows: Section 2 explains the related works.Section 3 covers the background on causality and diffusion models.The background on diffusion models and diffusion-based representations are outlined in Section 4. Section 5 outlines the addressed problem, the weakly supervised framework, and the identifiability conditions.Section 6 details the proposed DCRL framework.Experimental results are presented in Section 7. Finally, Section 8 concludes the paper and suggests potential future research directions.

Related Work 2.1. Diffusion-Based Representation Learning
Learning representations with diffusion models remains a relatively unexplored area.Several works have tried to train an external module (e.g., an encoder) along with the score function of the diffusion model to extract representations.Abstreiter et al. [43] and Mittal et al. [44] condition the score function of a diffusion model on a time-independent and time-dependent encoder and obtain finite and infinite-dimensional representations, respectively.Wang et al. [45] uses the same conditioning but regularizes the objective function with the mutual information between the input data and learned representations.Traub [48] performs the same conditioning but the authors use Latent Diffusion Models [54], where the inputs of the diffusion model are latent variables obtained from applying a pretrained autoencoder on the input.Furthermore, Kwon et al. [46] proposes an asymmetric reverse process that discovers the semantic latent space of a frozen diffusion model, where modification in the space synthesizes various attributes on input images.However, in principle, diffusion models lack a semantic latent space and it is unclear how to efficiently learn representations using their capabilities.

Causal Representation Learning
Given the inherent challenges of identifiability in causal representation learning, many previous studies have tackled this issue by imposing certain assumptions on the dataset or the causal structure.Several previous methods rely on additional knowledge of the data generation process, such as knowledge of the causal graph or labels for high-level causal variables.CausalGAN [55] requires the structure of the underlying causal graph to be known.Yang et al. [11] and Liu et al. [12] assume a linear structural equation model, and they require additional information associated with the true causal concepts as supervising signals.Similar to Yang et al. [11], Komanduri et al. [56] assumes the availability of supplementary supervision labels but without requiring mutual independence among factors.Von Kügelgen et al. [57] investigates self-supervised causal representation learning by utilizing a known, but non-trivial, causal graph between content and style factors.Subramanian et al. [13] applies Bayesian structure learning in the latent space and relies on having interventional samples.Sturma et al. [58] considers a setup where the authors have access to data from multiple domains that share a causal representation.Buchholz et al. [59] assumes the latent distribution is Gaussian and the authors have access to unknown singlenode interventional samples.Additionally, Ahuja et al. [15] analyzes various scenarios and the level of identifiability in the presence of interventional data.For an overview of causal representation learning, we refer to Schölkopf et al. [1].
Furthermore, there have been recent works on utilizing diffusion models in causality.Specifically, Sanchez and Tsaftaris [60] focuses on counterfactual estimation from observational imaging data given a known causal structure.Similarly, Sanchez et al. [61] aims to learn the underlying SCM in the low-level data space assuming a non-linear additive noise model, which is identifiable.However, both of these works focus on the SCM in the data space, while our approach focuses on learning the SCM in the latent space among the underlying latent variables in a weakly supervised setting.Other relevant work closely related to causal representation learning includes disentangled representations and independent component analysis [62][63][64][65][66].

Structural Causal Model
Following refs.Pearl [67], Bongers et al. [68], we describe the data-generating process (DGP) using the notion of structural causal models.A structural causal model (SCM) is a formal framework used to represent and analyze causal relationships among variables within a system.An SCM essentially consists of a set of random variables, and measurable functions between them specifying the underlying causal relationships of the DGP.We formally define SCMs as follows.
Definition 1 (Structural Causal Model (SCM), Definition 2.1 by Bongers et al. [68]).A structural causal model (SCM) is a tuple ⟨L, J, E , Z, f , µ⟩, where (i) L is a finite index set of endogenous variables; (ii) J is an index set of exogenous variables, which is disjoint with L; (iii) E = ∏ j∈J E j is the product of the domains of the exogenous variables, where each E j is a measurable space; (iv) Z = ∏ j∈L Z j is the product of the domains of the endogenous variables, where each Z j is a measurable space; (v) f : Z × E → Z is a measurable function that specifies the causal mechanism; and (vi) µ = ∏ j∈J µ j is a product measure, where µ j is a probability measure on E j for each j ∈ J.
In the definition above, the functional relationships between variables are expressed in terms of a function f .This feature allows us to model the cause-effect relationships of the data-generating process (DGP) using structural equations.Structural equations are mathematical representations used to describe causal relationships among variables in a system.They express how one or more variables causally influence others within a causal graphical model.For a given SCM as above, a structural equation specifies an endogenous random variable z l via a measurable function of the form z l = f l (z, e) where z ∈ Z, e ∈ E .This function essentially captures the deterministic relationships specified by f as in Definition 1.A parent i ∈ L ∪ J of l is any index for which there is no measurable function k : ∏ j∈L\{i} Z j × E → Z l with f l = k almost surely.Intuitively, each endogenous variable z l is specified by its parents together with the exogenous variables via the structural equations.
A structural equation model as in Definition 1 can be conveniently described with the causal graph, a directed graph of the form G = (V, E).The nodes of the causal graph consist of the entire set of indices for the endogenous variables, and the edges are specified by the structural equations, i.e., {j → l} ∈ E if and only if j is a parent of l.Note that the variables in the set pa(z l ) are indexed by the parent nodes of l in the corresponding graph G.An example of a causal diagram is given in Figure 1(left).Solution Functions.An alternative way of defining SCMs replaces causal mechanisms with solution functions h : E → Z which maps exogenous noise variables to endogenous causal variables, i.e., z i = h i (e), e ∈ E , and is defined by successively applying the causal mechanisms f .Solution functions contain the same information as causal mechanisms and they can be derived from each other.We utilize this formulation in our framework.
Interventions.A very important aspect of SCMs is that they allow us to reason about cause-effect relationships using interventions.Interventions refer to deliberate changes or manipulations made to one or more variables within the model to study their causal effects on other variables.In this paper, we specifically consider perfect interventions [67].For a given SCM as in Definition 1, consider a variable W := ∏ j∈L ′ Z j for a set L ′ ⊆ L, and let w := ∏ j∈L ′ w j be a point of its domain.The perfect intervention W ← w amounts to replacing the structural equations z j = f j (z, e) with the constant functions z j ≡ w j for all j ∈ L ′ .We denote with z | do(w) the variables z after performing the interventions.This procedure defines a new probability distribution p z (z | do(w)), which we refer to as interventional distribution.This distribution entails the following information: If we apply do(w), what will be the value of z?We extend this definition by defining I as the set of interventions entailed by w, and we utilize this formulation in our framework.An example of a causal graph and a single perfect intervention is depicted in Figure 1.
Equivalence of SCMs.We now define the concept of equivalence between structural causal models.Two SCMs are structurally equivalent if their respective sets of structural equations and exogenous variables are equivalent.Formally, the notion of equivalence is defined as follows.
Definition 2. Consider two SCMs ⟨L, J, E , Z, f , µ⟩ and ⟨L ′ , J ′ , E ′ , Z ′ , f ′ , µ ′ ⟩.Consider their respective causal graphs G and G ′ .An isomorphism between the two SCMs consists of the following: the vertices of G to the vertices of G ′ , such that there exist an edge σ(i) → σ(j) in G ′ iff.there exist an edge i → j in G.) (B) Measure-preserving (A measure-preserving function l : A → B ensures that the probability distribution in the domain space A remains the same when mapped to the co-domain space B through the function l.) invertible functions l j : Z j → Z ′ σ(j) such that the function l(z) := ∏ j∈L l j (z j ) yields f ′ (l(z), e) = l( f (z, e)) for all z ∈ Z, e ∈ E .We say that two SCMs are equivalent if their domains are identical and such an isomorphism exists between them.
Definition 2 ensures that the causal mechanisms of equivalent SCMs are essentially identical.The functions l j in Definition 2 reparameterize the random variables in both models such that the structural equations and causal relationships are preserved.

Diffusion Models 4.1. Overview
The fundamental concept behind diffusion-based generative models is to learn to generate data by inverting the diffusion process.Diffusion models comprise two processes: a forward process and a backward process.The forward process gradually adds noise to data and maps data to (almost) pure noise.The backward process, on the other hand, is used to go from a noise sample back to the original data space.
The forward process is defined by a stochastic differential equation (SDE) across a continuous time domain t ∈ [0, 1], aiming to transform the data distribution to a known prior distribution, typically a standard multivariate Gaussian.Given x 0 sampled from a data distribution p(x 0 ), the forward process constructs a trajectory (x t ) t∈[0,1] across the time domain.We utilize the Variance Exploding SDE [53] for the forward process, which is defined as: where w is the standard Wiener process, and σ 2 (t) is the noise variance of the diffusion process at time t.The backward process is also formulated as an SDE in the following manner: where w is the standard Wiener process in reverse time.Score matching.To use this backward process, the score function ∇ x log p t (x) is required.It is usually approximated by a neural score function s θ (•) which can be trained by Explicit Score Matching [69] defined as: where λ(t) is a positive weighting function.However, the ground-truth score function ∇ x log p t (x) is generally not known.Vincent [70] addresses this issue by proposing Denoising Score Matching.The approximate score function is then learned by minimizing the loss function: where the conditional distribution of x t given x 0 is p t (x t |x 0 ) = N (x t ; x 0 , [σ 2 (t) − σ 2 (0)]I).This objective function originates from the Evidence Lower Bound (ELBO) of the data distribution, and it has been shown that with a specific weighting function, this objective function becomes exactly a term in the ELBO [53].For more details, see Appendix B.

Diffusion-Based Representations
Conditional Score Matching.We can modify Denoising Score Matching so that the score function receives additional information through an external trainable module.This results in a conditional diffusion model which allows to perform representation learning while training the score function.Abstreiter et al. [43] proposes conditional Denoising Score Matching defined as: where the score function is conditioned on a module E ϕ (x 0 ) which provides additional information about the data to the diffusion model through a learned encoder with parameters ϕ.In fact, the encoder learns to extract necessary information from x 0 in a reduced-dimensional space that helps recover x 0 by denoising x t .Abstreiter et al. [43] also presents an alternative objective where the encoder is a function of time.Formally, the new objective is With this objective, the encoder learns a representation trajectory of x 0 instead of a single representation.Training this system has the potential to minimize the objective to zero, motivating the encoder E ϕ (.) to learn meaningful, distinct representations at different timesteps [43,44].
Comparison with Other Generative Models.The key difference between the other generative models and diffusion-based representations is that other generative models are only concerned with one finite code and all the information is encoded into this single code, while in the latter, different levels of information are encoded along an infinitedimensional code, i.e., the encoder is conditioned on time t and produces a trajectorybased representation (E ϕ (x 0 , t)) t∈[0,1] .Within this representation, various points along the trajectory contain different levels of information as highlighted by Mittal et al. [44].In this work, we first explore a time-independent single code, where we employ Equation (1) and show that with a certain weighting function, this objective function will become the ELBO.Then, we apply the same experiments with infinite-dimensional latent code (Equation ( 2)) and study the benefits and implications of these formulations for causal representation learning.

Problem Formulation
We consider a system that is described by an unknown underlying SCM on the latent causal variable z, where we have access to low-level data pairs (x 0 , x0 ) ∼ p(x 0 , x0 ) representing the system before and after a random, unknown, and atomic intervention.We consider the assumptions and the data-generation process that will be described in Section 5.1.Our objective is to learn an SCM that accurately represents the true underlying SCM associated with the given data, up to a permutation and elementwise reparameterization of causal variables and solution functions.To this end, we train an SCM by maximizing the likelihood of data.With sufficient data and perfect optimization, we can find the SCM that is equivalent to the ground-truth SCM.

Weakly Supervised Framework
We build our weakly supervised framework on the assumptions and identifiability conditions established by Brehmer et al. [14].We try to learn the underlying SCM over unknown latent causal variables z of a system in which low-level information x 0 ∈ X generated directly from z through an unknown function g : Z → X is available.Following Brehmer et al. [14], Locatello et al. [26], we consider a dataset that consists of paired datapoints (x 0 , x0 ), generated as follows: Since the intervention is perfect, the solution function will also change in a way that only for the intervened variable is the dependency between the latent causal variable z I and its parents removed.For the complete list of assumptions, see Appendix A.
It is proven that under this weakly supervised setting, it is possible to identify the latent causal variables and solution functions up to a permutation and elementwise reparameterization of the variables.For the proof of the identifiability of the described system, we refer to Brehmer et al. [14].

Non-Identifiability from Observational Data
In this section, we show that interventions are necessary for identifiability in this setting.In fact, note that Definition 2 implies that the distributions of two equivalent SCMs are the same, up to a measure-preserving invertible function.However, two SCMs may entail the same observational distribution on the generated data, even if their respective causal mechanisms are not equivalent.This is best illustrated with an example.Consider two datasets {X 1 , Y 1 } and {X 2 , Y 2 }.The respective DGPs are: where the covariance matrix Σ is defined as Note that both datasets {X 1 , Y 1 } and {X 2 , Y 2 } entail the same observational distribution.However, these datasets have different causal mechanisms.In particular, their respective causal diagrams are not isomorphic.Hence, by this, we see that the same observational distribution may entail different causal diagrams.This means that the causal dynamics of an SCM cannot be inferred from the distribution of a given observational dataset, i.e., SCMs are unidentifiable from observational data.

Limitations
While our goal is to execute a robust and informative study to address the selected research question, it is important to acknowledge inherent limitations related to data, model assumptions, and evaluations.First, our evaluation is limited to synthetic datasets in a single modality.Furthermore, we consider the weakly supervised data-generation process and assumptions for the identifiability of the underlying model, which may limit the practical application of our work in systems where the assumptions do not hold.Finally, the representation learning process relies on an encoder, which acts as an information channel, regulating the amount of input information transmitted to the score function during each step of the diffusion process.It is important to note that in certain scenarios, the encoder may not be essential to the diffusion process and could potentially result in collapsing behavior.However, it is important to emphasize that our work is a preliminary step towards utilizing diffusion models for causal representation learning and lays the foundation for significant further research in this area.

The DCRL Framework
Figure 2 provides a visual representation of the framework's architecture.In this study, we utilize a conditional diffusion model and apply it to the input data (x 0 , x0 ), where x 0 , x0 ∈ R 3×W×H and W and H are the width and height of the input, respectively.We denote (x t ) t∈[0,1] as the diffusion trajectory across the time domain with x 0 as the input data.The conditioning module is defined as the encoding module, generating high-level diffusion-based representations (e, ẽ) for each low-level data pair, where e, ẽ ∈ R d and d is the number of latent causal variables assumed to be known.We empirically show that these latent variables contain equivalent information as in noise variables of the underlying SCM and can be used interchangeably.Then, we infer the intervention target I ∈ {0, 1, ..., d − 1} for each data pair by an intervention module and use neural solution functions on top of the latent variables (e, ẽ) and the intervention target I to obtain the underlying latent causal variables z, z ∈ R d .We base our framework on the Implicit Latent Causal Model (ILCM) introduced by Brehmer et al. [14] and describe each part of our framework in the next paragraphs.

Conditional Diffusion Model
Based on the formulation described in Section 4, we use a conditional diffusion model.A stochastic encoder q(e|x 0 ) serves as the conditioning module, mapping low-level data space to high-level latent space.When employing a finite code where the stochastic encoder is independent of time, e is a single vector of size d.In this case, the framework learns a single SCM.Alternatively, in the case of using infinite-dimensional latent code, the stochastic encoder generates (e t ) t∈[0,1] which is a trajectory-based representation across time.At each timestep t, e t ∈ R d represents a single point of the trajectory.In this scenario, the framework learns an SCM at each timestep.In the following paragraphs, for the sake of simplicity, we use the single-code formulation.

The Encoding and the Intervention Module
The encoding module consists of two main parts: the stochastic encoder and the projection module.The stochastic encoder q(e|x 0 ) maps data pairs (x 0 , x0 ) to pre-projection latent variables (e, ẽ).The encoded inputs are then utilized in the intervention module q(I|x 0 , x0 ) to infer the intervention target I for the data pair (x 0 , x0 ).Based on our data generation process in Section 5.1, the encoded inputs have the property that only for the elements that are intervened upon do we have e i ̸ = ẽi , i ∈ I, and the rest will remain the same.Based on this property, in order to infer interventions, we employ an intervention module q(I|x, x) which is defined heuristically as where µ e (x 0 ) is the mean of the stochastic encoder q(e|x 0 ); α, β, and γ are learnable parameters; and Z is a normalization constant.This simple heuristic function ensures that a variable has a higher chance to be selected as the intervened variable if it undergoes more significant changes in response to the intervention.Once the intervention is inferred from the preprojection latent variables, we apply the projection module.Similar to Brehmer et al. [14], the projection module is dependent on the inferred intervention target I and projects the encoded input (e, ẽ) to new latent variables in a way that for the components e i that are not intervened upon i / ∈ I, the pre-intervention and post-intervention latent components will be equal e i = ẽi .This prevents the framework from deviating from the weakly supervised structure.
We write the combination of the encoder and the projection module as q(e, ẽ|x 0 , x0 , I), and refer to it as the encoding module.By this definition, the encoding module q(e, ẽ|x 0 , x0 , I) maps the input (x 0 , x0 ) to latent variables (e, ẽ) and the intervention module infers the intervention I based on pre-projection latent variables.

Prior
Given the intervention target I and latent variables (e, ẽ), we define the prior p(e, ẽ, I) as p(e, ẽ, I) = p(I)p(e)p( ẽ|e, I).The objective of the prior distribution is to implicitly capture the causal structure and causal mechanisms within the system.Specifically, p(I) and p(e) denote the prior distributions over intervention targets and latent variables, respectively, and are configured as uniform categorical with each latent variable as a category, and standard Gaussian distributions, respectively.According to our data generation process, when an intervention is applied, only the elements in the latent variables that are intervened upon are altered; the other elements remain unchanged and independent of each other.Consequently, we can define p( ẽ|e, I) as follows: In this equation, δ(.) is the Dirac delta function that fulfills this property for nonintervened latent variables.

Neural Solution Functions
In order to encode the information about the intervened variables, we incorporate a conditional normalizing flow p( ẽi |e) defined as where h(.) are the solution functions of the SCM.They are defined as invertible affine transformations with parameters learned with neural networks.Therefore, by learning solution functions, i.e., learning to transform e to z, we implicitly model the causal graph into the framework and obtain the latent causal variables.For more details about the implementation, see Appendix C.

The Evidence Lower Bound for DCRL
We calculate the Evidence Lower Bound (ELBO) for the proposed model for the framework described in the previous section.In the case of having single-point representations in which the noise variable e is independent of time, the ELBO becomes: + log p( ẽ|e, I) − log q(I|x 0 , x0 ) − log q(e, ẽ|x 0 , x0 , I) , where λ(t) is a positive weighting function, and β = 1.We train the model by minimizing a reweighted loss function reminiscent of β-VAEs, setting β to 0 and increasing it to 1 during training.
In the case of using infinite-dimensional representations (Equation ( 2)), the objective function becomes: + log p( ẽt |e t , I) − log q(I|x 0 , x0 ) − log q(e t , ẽt |x 0 , x0 , I) , where (e t ) t∈[0,1] is the trajectory-based representation and e t ∈ R d is the single point of the trajectory at time t.For a complete derivation of the ELBO, see Appendix B.
To prevent a collapse of the latent space to a lower-dimensional subspace, we add the negative entropy of the batch-aggregate intervention posterior as a regularization term to the loss function: where E batches [ • ] is the expected value over all the batches of data, and q batch I (I) is defined as After the training, the framework contains information about the underlying causal structure and latent causal variables, and it can be used in different downstream tasks.

Experiments
Here, we analyze the performance of the proposed model, DCRL, on synthetic data.We employ DCRL for the task of causal discovery.After training DCRL, we use the framework to obtain causal variables (z, z) for the test set, and apply ENCO [71], a continuous optimization structure learning method that leverages observational and interventional data, on the obtained samples to infer the underlying causal graph.Furthermore, we evaluate the learned causal variables with the DCI framework [72].
Data Generation.In order to generate latent causal variables, we adopt random graphs, where each edge in a fixed topological order is sampled from a Bernoulli distribution with a parameter that is equal to 0.5.We consider the SCM to be linear Gaussian and we sample the weights from a multivariate normal distribution with zero mean and unit variance.We make sure the weights are not close to zero to avoid violation of the faithfulness assumption.We introduce additive Gaussian noise with equal variances across all nodes, with its variance set to 0.1.Latent causal variables are then sampled using ancestral sampling, and we generate 10 5 training samples, 10 4 validation samples, and 10 4 test samples.Finally, to generate input data x 0 , we apply a random linear projection on the obtained latent variables.We keep the dimension of x 0 fixed to 16.We utilize an SCM with 5, 10, and 15 variables.To enhance the robustness of the results, we generate data for 4 different seeds and repeat our experiments for each seed.
Baselines.We consider ILCM [14] as our main baseline.To the best of our knowledge, there are no other methods that consider the same weakly supervised assumptions, and adapting other methods to our assumptions either substantially changes the method or is infeasible.We also evaluate the outcomes against a variation of disentanglement VAE proposed by [26] tailored for weakly supervised settings.This model, referred to as d-VAE, models the weakly supervised process but assumes unconnected variation factors instead of a causal relationship among variables.Similarly, we apply ENCO on top of both to obtain the learned graph.
Metrics.We assess the performance of models with the following metrics: • The Structural Hamming Distance (SHD) is a metric used to quantify the dissimilarity between two directed acyclic graphs (DAGs) by measuring the minimum number of edge additions, deletions, and reversals required to transform one graph into another.It is calculated by summing up the absolute differences between the entries of adjacency matrices of two graphs.

•
The DCI Disentanglement Score is a metric used to evaluate the disentanglement quality of a generative model and takes values between 0 and 1. Disentanglement refers to the extent to which the model learns to predict the underlying factors of variation in the data in a way that each predicted variable captures at most one underlying factor.If a predicted factor is important to predict a single underlying factor, the score will be 1, and if a predicted factor is equally important to predict all the underlying factors, the score will be 0 [72].

•
The DCI Completeness Score measures how well each underlying factor of variation is captured by a single predicted latent variable and has a value between 0 and 1.If a single variable contributes to one underlying factor, the score will be 1, and if all variables equally contribute to the prediction of a single factor, the score will be 0 [72].

Single-Point Representations
Utilizing single-point representations where e ∈ R d and is independent of time, our method demonstrates superior or competitive performance compared to the baselines as indicated by the metrics shown in Figure 3.The d-VAE performs poorly across all metrics primarily because it assumes independent rather than causal relationships among variables.In scenarios involving 5 and 10 causal variables, ILCM shows comparable performance to DCRL, suggesting that a standard VAE can sufficiently capture essential information about causal factors.However, in higher dimensions, our method excels by capturing more detailed information about causal variables and their underlying structure.Our findings indicate that diffusion-based representations are more beneficial in higher dimensions, providing more accurate information about the underlying causal variables compared to other baseline methods.Comparison of models on different metrics when using single-point representations.Our approach outperforms or competes favorably with the baseline methods on all metrics.Particularly in higher dimensions, our method excels by capturing additional information about the causal variables and the underlying causal structure.

Infinite-Dimensional Representations
In these experiments, we utilize the infinite-dimensional representations approach to develop trajectory-based representations for each input x 0 , denoted as (e t ) t∈[0,1] .In order to perform inference, we sample points from this trajectory at intervals of 0.1 resulting in 11 specific timesteps.The outcomes are depicted in Figure 4. Generally, representations in the middle of the trajectory contain the most information and are comparable to or even outperform the baselines.Going further in time, representations appear to lose information but improve as they move towards the end of the trajectory.This phenomenon arises because during training, as we are further in time, the noise in the diffusion model is fairly high and the conditioning module compensates for that by providing the necessary information for the diffusion model to learn the score function.From top to bottom, each row corresponds to experiments with 5, 10, and 15 causal variables, respectively.We sample points from the trajectory at intervals of 0.1, creating a total of 11 specific timesteps.Typically, representations in the middle of the trajectory carry the most information, often matching or surpassing the baseline performance.As we move further in time, representations seem to lose some information, but they improve as they approach the end of the trajectory.Furthermore, the framework performs worse or on par with baselines in lower dimensions but generally outperforms them in higher dimensions.

Conclusions
Identifying the underlying causal variables and mechanisms of a system solely from observational data is considered impossible without additional assumptions.In this project, we use weak supervision as an inductive bias and study whether the information encoded in the latent code of diffusion-based representations contains useful knowledge of causal variables and the underlying causal graph.
This study represents an initial exploration of applying diffusion models to causal representation learning, highlighting the need for further research and extensions in this area.Our method relies on an external encoder to provide necessary information for the diffusion model to learn the score function.Future work could focus on integrating more efficient ways of acquiring representations from diffusion models without external dependencies or conditioning.Additionally, extending the weakly supervised framework to higher dimensions and other modalities, such as video or multi-view data, is another potential direction.Applying the proposed method to domains such as experimental design, reinforcement learning, and robotics-where the independent actions can be considered interventions and the system's state before and after an action is observable-presents another promising avenue for research.Finally, extending the framework to other settings, such as dynamical systems, where the infinite-dimensional latent code corresponds to the system's state at different timesteps, is another interesting potential direction.
• D KL (q(u T |x 0 )||p(u T )) is the prior matching term and can similarly be defined in a way that it is constant.
• E u t |x 0 [D KL (q(u t−1 |u t , x 0 , e)||p(u t−1 |u t , e)] is a denoising matching term.This term is the origin of different interpretations of the score-based diffusion models.

Figure 1 .
Figure 1.A causal graph before and after an intervention.Applying a perfect intervention on z 3 eliminates the dependencies between this node and its parents in the causal graph.

e
∼ p e (e), I ∼ p I (I), z = h(e), x 0 = g(z) ẽ ∼ p e (e | do(e ′ )) with e ′ ∼ p e I (e ′ ), z = hI ( ẽ), x0 = g( z)where e and ẽ are the exogenous noise variables of the underlying SCM, h(•) and hI (•) are the solution functions before and after a single perfect intervention I, and p I (•) is a prior on all possible values of atomic interventions such that p e I (e ′ ) > 0 for every possible atomic intervention.In this setting, p e (e | do(e ′ )) is defined such that the noise variable remains the same and changes only for the element that is intervened upon, i.e., ẽI = e ′ ̸ = e I , ẽ\I = e \I .

Figure 2 .
Figure 2. Overview of our framework.Here, we have a paired image of a face before and after an intervention (the smile).The paired image is mapped to latent variables by a stochastic encoder.The intervention target is determined by applying the intervention encoder to these latent variables.To maintain the weakly supervised structure, the latent variables are projected into a new pair and then serve as the conditioning module for a conditional diffusion model.The projected latent variables are in fact diffusion-based representations of the input pair.Finally, they are utilized in neural solution functions together with the intervention target to obtain the latent causal variables.

Figure 3 .
Figure 3.Comparison of models on different metrics when using single-point representations.Our approach outperforms or competes favorably with the baseline methods on all metrics.Particularly in higher dimensions, our method excels by capturing additional information about the causal variables and the underlying causal structure.

Figure 4 .
Figure 4. Comparison of models on different metrics when using infinite-dimensional representations.From top to bottom, each row corresponds to experiments with 5, 10, and 15 causal variables, respectively.We sample points from the trajectory at intervals of 0.1, creating a total of 11 specific timesteps.Typically, representations in the middle of the trajectory carry the most information, often matching or surpassing the baseline performance.As we move further in time, representations seem to lose some information, but they improve as they approach the end of the trajectory.Furthermore, the framework performs worse or on par with baselines in lower dimensions but generally outperforms them in higher dimensions.